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LIKELIHOOD-BASED MODIFICATION OF 
EXPERIMENTAL CRYSTAL STRUCTURE ELECTRON DENSITY MAPS 



RELATED APPLICATIONS 
This application claims the benefit of U.S. provisional patent application S.N. 
60/135,252, filed May 21, 1999. 

STATEMENT REGARDING FEDERAL RIGHTS 
This invention was made with government support under Contract No. W- 
7405-ENG-36 awarded by the U.S. Department of Energy. The government has 
certain rights in the invention. 

FIELD OF THE INVENTION 
The present invention relates generally to the determination of crystal 
structure from the analysis of diffraction patterns, and, more particularly, to 
macromolecular crystallography. 

BACKGROUND OF THE INVENTION 
The determination of macromolecular structures, e.g., proteins, by X-ray 
crystallography is a powerful tool for understanding the arrangement and function of 
such macromolecules. Very powerful experimental methods exist for determining 
crystallographic features, e.g., structure factors and phases. While the structure 
factor amplitudes can be determined quite well, it is frequently necessary to improve 
or extend the phases before a realistic atomic model of the macromolecule, such as 
an electron density map, can be built. 

Many methods have been developed for improving the phases by modifying 
initial experimental electron density maps with prior knowledge of characteristics 
expected in these maps. The fundamental basis of density modification methods is 
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that there are many possible sets of structure factors (amplitudes and phases) that 
are all reasonably probable based on the limited experimental data that is obtained 
from a particular experiment, and those structure factors that lead to maps that are 
most consistent with both the experimental data and the prior knowledge are the 
most likely overall. In these methods, the choice of prior information that is to be 
used, and the procedure for combining prior information about electron density with 
experimentally-derived phase information are important features. 

Until recently, electron density modification has generally been carried out in 
a two-step procedure that is iterated until convergence. In the first step, an electron 
density map obtained experimentally is modified in real space in order to make it 
consistent with expectations. The modification can consist of, e.g., flattening 
solvent regions, averaging non-crystallographic symmetry-related regions, or 
histogram-matching. In the second step, phases are calculated from the modified 
map and are combined with the experimental phases to form a new phase set. 

The disadvantage of this real-space modification approach is that it is not at 
all clear how to weight the observed phases from those obtained from the modified 
map. This is because the modified map contains some of the same information as 
the original map and some new information. This has been recognized for a long 
time and a number of approaches have been designed to improve the relative 
weighting from these two sources, including the use of maximum-entropy methods, 
the use of weighting optimized using cross-validation, and "solvent-flipping." 

A comprehensive theory of the phase problem in X-ray crystallography and a 
formalism for solving it based on maximum entropy and maximum likelihood 
methods has been presented by Bricogne, Acta Cryst. A40, pp. 410-445 (1984) and 
Bricogne, Acta Cryst. A44, pp. 517-545 (1988). This formalism describes the 
contents of a crystal in terms of a collection of point atoms along with probabilities 
for their positions. From the positions of these atoms, crystallographic structure 
factors can be calculated, with a certainty depending on the certainties of the 
positions of the atoms. Extensions of the formalism are described in Bricogne 
(1988). The extended formalism specifically addresses the situation encountered in 
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crystals of macromolecules in which defined solvent and macromolecule regions 
exist in the crystallographic unit cell, and formulas for calculating probabilities of 
structure factors based on the presence of "flat" solvent regions are presented 
(Bricogne, 1988). The implementation of this formalism is not straightforward 
5 according to Xiang et al., Acta Cryst. D49, pp. 193-212 (1993), who point out that a 
full fledged implementation of this approach would be highly desirable and would 
provide a statistical technique for enforcing solvent flatness in advance. Xiang et al 
(1993) report that they settled for an approximation in which solvent flatness outside 
the envelope is imposed after the calculation of a model for the distribution of 

10 atoms, which corresponds to the existing procedure of flattening the solvent in an 
electron density map (Wang, Methods Enzymol. 115, pp. 90-1 12 (1985)). 

The present invention solves the same problem that earlier procedures 
proposed by Bricogne (1988) address, and also includes the use of likelihood as a 
basis for choosing optimal crystallographic structure factors. The assumptions used 

15 in the present procedure differ substantially from those used by Bricogne (1988). 
For treatment of solvent and macromolecule (protein) regions in a crystal, Bricogne 
develops statistical relationships among structure factors based on a model of the 
contents of the crystal in which point atoms are randomly located, but in which 
atoms in the protein region are sharply-defined with low thermal parameters and 

20 atoms in the solvent region are diffuse, with high thermal parameters. In the 

present approach, no assumptions about the presence of atoms or possible values 
of thermal factors are used. Instead, it is assumed that values of electron density in 
the protein and solvent regions, respectively, are distributed in the same way in the 
crystal as in a model calculation of a crystal that may or may not be composed of 

25 discrete atoms. 

The methods used to find likely solutions to the phase problem are also very 
different in the present approach compared to that of Bricogne (1988) because the 
assumptions used require the problem to be set up in different ways. Bricogne 
(1988) applies a maximum-entropy formalism developed by Bricogne (1984) to find 

30 likely arrangements of atoms in the crystal, which in turn can be used to calculate 
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the arrangement of electron density in the crystal. In the present method, likely 
values of the structure factors are found by applying a likelihood-based approach 
based on a combination of experimental information and the likelihood of resulting 
electron density maps. These structure factors can be used to calculate an electron 
density map that is then, in turn, a likely arrangement of electron density in the 
crystal. 

Various objects, advantages and novel features of the invention will be set 
forth in part in the description which follows, and in part will become apparent to 
those skilled in the art upon examination of the following or may be learned by 
practice of the invention. The objects and advantages of the invention may be 
realized and attained by means of the instrumentalities and combinations 
particularly pointed out in the appended claims. 

SUMM ARY Or T l IE I NVEN TION 
In accordance with the purposes of the present invention, as embodied and 
broadly described herein, the present invention includes a method for impptfving an 
electron density map of an experimental crystal structure. A NkeliJjetxJ of a set of 

structure factors { F h} is formed for the experimental cry^structure as (1) the 
likelihood of having obtained an observed set^pHtructure factors i r h / if structure 
factor set { F h} was correct, and (2) tljelikelihood that an electron density map 
resulting from { F h} is consistenfwith selected prior knowledge about the 
experimental crystal stricture. The set of structure factors is then adjusted to 
maximize theJfKelihood of for the experimental crystal structure. 

BRIEF DESCRIPTION OF THE DRAWINGS 
The accompanying drawings, which are incorporated in and form a part of 
the specification, illustrate embodiments of the present invention and, together with 
the description, serve to explain the principles of the invention. In the drawings: 
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FIGURE 1 is a flow sheet for a process to obtain characteristics from a model 
electron density map. 

FIGURE 2 is a flow sheet for a process to derive structure factors consistent 
with experimental results which result in an electron density map with expected 
5 characteristics. 

FIGURE 3A is a computer-generated electron density map provided by 
SOLVE software and calculated using only one substituted selenium atom. 

FIGURE 3B is a computer-generated model electron density map calculated 
from an atomic model of the selected protein. 
10 FIGURE 3C is a computer-generated electron density map derived from the 

process shown in FIGURES 1 and 2. 

FIGURE 3D is a computer-generated electron density map derived from 
alternate available software called "dm". 



ni In accordance with the present invention, experimental phase information is 

q combined with prior knowledge about expected electron density distribution in maps 

0 J b y maximizing a combined likelihood function. The fundamental idea is to express 

ni 

U1 knowledge about the probability of a set of structure factors { F h} (F h includes 

Q 20 amplitude , ^ h , and phase, <f> factors) and in terms of two quantities: (1 ) the 

likelihood of having measured the observed set of structure factors 1^ J if this 

structure factor set { F h} were correct; and (2) the likelihood that the map resulting 

from this structure factor set { F h} is consistent with prior knowledge about the 
structure under observation and other macromolecular structures. The index factor 

25 h is defined in terms of the hkl plane and unit vectors a*,b*,c* j n reciprocal lattice 
space as h = ha*+kb*+lc*. 

When formulated in this manner, the overlap of information that occurred in 
the real-space modification methods is not present because the experimental and 
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prior information are kept separate. Consequently, proper weighting of 
experimental and prior information only requires estimates of probability functions 
for each source of information. 

The likelihood-based density modification approach has a second very 
5 important advantage. This is that the derivatives of the likelihood functions with 
respect to individual structure factors can be readily calculated in reciprocal space 
by Fast Fourier Transform (FFT) based methods. As a consequence, density 
modification simply becomes an optimization of a combined likelihood function by 
adjustment of structure factors. This makes density modification a remarkably 
10 simple but powerful approach, requiring only that suitable likelihood functions be 
constructed for each aspect of prior knowledge that is to be incorporated. 

The basic idea of the likelihood-based density modification procedure is that 
there are two key kinds of information about the structure factors for a crystal of a 
macromolecule. The first is the experimental phase and amplitude information, 
15 which can be expressed in terms of a likelihood (or a long-likelihood function 

for each structure factor ^i, . The experimental probability distribution for 

the structure factor, P 0BS {^h) is given by 

^(F h ) = exp{LL^(F H )} 0) 

For reflections with accurately-measured amplitudes, the chief uncertainly in ^ h will 
be in the phase, while for unmeasured or poorly-measured reflections, it will be in 
20 both phase and amplitude. 

The second kind of information about structure factors in this formulation is 
the likelihood of the map resulting from the factors. For example, for most 

macromolecular crystals, a set of structure factors { F h} that leads to a map with a 
flat region corresponding to solvent is more likely to be correct than one that leads 
25 to a map with uniform variation everywhere. This map likelihood function describes 
the probability that the map obtained from a set of structure factors is compatible 
with expectations: 



S 91 ,732 



# • 



nj 



7 

p^(F h ) = exp{LL-''(F H )} (2) 

The two principal sources of information are then combined, along with any prior 
knowledge of the structure factors, to yield the likelihood of a particular set of 
structure factors: 

LL({F h })= LL°({F h }) + LL 0BS ({F h }) + LL MAP ({F h }) O) 

where includes any structure factor information that is known in advance, 

5 such as the distribution of intensities of structure factors. 

In order to maximize the overall likelihood function in Eq. (3), the change in 
the map likelihood function in response to changes in structure factors must be 

known. In the case of the map likelihood function, ^^({^j) , there are two linked 
relationships: the response of the likelihood function to changes in electron density, 

10 and the changes in electron density as a function of changes in structure factors. In 
principle, the likelihood of a particular map is a complicated function of the electron 
density over the entire map. Furthermore, the value of any structure factor affects 
the electron density everywhere in the map. 

For simplification, a low-order approximation to the likelihood function for a 

15 map is used instead of attempting to evaluate the function precisely. As Fourier 
transformation is a linear process, each reflection contributes independently to the 
electron density at a given point in the cell. Although the log-likelihood of the 
electron density might have any form, it is expected that for sufficiently small 
changes in structure factors, a first-order approximation to the log-likelihood function 

20 would apply and each reflection would also contribute relatively independently to 
changes in the log-likelihood function. 

Consequently, a local approximation to the map likelihood function can be 
constructed, neglecting correlations among different points in the map and between 
reflections, expecting that it might describe with reasonable accuracy how the 

25 likelihood function would vary in response to small changes in the structure factors. 
By neglecting correlations among different points in the map, the log-likelihood for 
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the whole electron density map is written as the sum of the log-likelihood of the 
densities at each point in the map, normalized to the volume of the unit cell and the 
number of reflections used to construct it: 

^({F h })*^jLL(x,{F h })^x (4) 

where N REF j s the number of independent reflections and V is the volume. 

By treating each reflection as independently contributing to the likelihood 
function, a local approximation to the log-likelihood of the density at each point 

is written. This approximation is given by the sum over all reflections 
of the first few terms of a Taylor's series expansion around the value obtained with 

the starting structure factors used in a cycle of density modification, 

LLWx,{F h }))«Zx(p(x,{F h °})) + (5) 



+ 



Z 



^1 -J-^-fo}))^^ -| r LL(p(x,{F h })) + 

2 



± ^W X ' ^ h })) + \ ~Qp£~ LL{p{ X ' {^h }))+• • •] ' 

where and are the differences between F h and F h ° along the directions 

F h ° and iF h °, respectively. 

Combining Eqs. (4) and (5) results in an expression for the map log- 
likelihood function, 
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+AF h ,J-|-LL( /7 (x,{F h }))j 3 x 

+ jAF h 2 J-^-LL(p(x,{F h }))^x + ...] 

The integrals in Eq. (6) can be rewritten in a form that is suitable for 
evaluation by a FFT-based approach. Considering the first integral in Eq. (6), use 
the chain rule to write, 

and note that the derivative of with respect to ^.fl for a particular index value 
h is given by, 

5 «/„\ 2 ni.*-2«>li«l ( 8 ) 



■p(x) = -Re[e i *>- 2 * ibx ] 



Now the first integral in Eq. (6) is rewritten in the form, 

S r./J.,.,^...- 2 „J.«._.1 0) 



where the complex number is a term in the Fourier transform of 

d 



4) LLWX ' {FJ)) 



ah= i4o LLWx,{Fh}))e " hv3x 



(10) 



In space groups other than P1 , only a unique set of structure factors needs to be 
specified to calculate an electron density map. Taking space group symmetry into 
account, Eq. (9) can be generalized to read, 
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9 tt(J„ ru nU3„_ r . i (11) 



where the indices h' are all indices equivalent to h due to space-group symmetry. 

A similar procedure is used to rewrite the second integral in Eq. (6), yielding 
the expression, 

V CP h\\ V h',k' 

where the indices h' and k' are each all indices equivalent to h due to space 
group symmetry, and where the coefficients K are again terms in a Fourier 
transform, this time the second derivative of the log-likelihood of the electron 
density, 

^OT" Wx ' {Fj))e2 * Vx <13) 

The third and fourth integrals in Eq. (6) can be rewritten in a similar way 
yielding the expressions, 

l^u.(^})Y^^<] (14) 

ur h,± v h' 

and 



V ° r h,± V h',k' 



(15) 



The significance of Eqs. (4) through (15) is that there is now a simple 
expression (Eq. (6)) describing how the map likelihood function LL mp ({F b }) varies 
when small changes are made in the structure factors. Evaluating this expression 
requires only that the first and second derivatives of the log-likelihood of the 
electron density be calculated with respect to electron density at each point in the 
map (see Eq. (22) below) and that a Fast Fourier Transform (FFT) be carried out as 
described by Teneyck, Acta Cryst. 33, pp. 486-492 (1977), incorporated by 
reference. Furthermore, maximization of the (local) overall likelihood function (Eq. 
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(3)) becomes straightforward, as every reflection is treated independently. It 
consists simply of adjusting each structure factor to maximize its contribution to the 
approximation to the likelihood function through Eqs. (3)-(15). 

In practice, instead of directly maximizing the overall likelihood function, it is 
5 used here to estimate the probability distribution for each structure factor, and then 
to integrate this probability distribution over the phase (or phase and amplitude) of 
the reflection to obtain a weighted mean estimate of the structure factor. Using Eqs. 
(3)-(15), the probability distribution for an individual structure factor can be written 
as, 

lnp(F h )*LL°(F h ) + LL 0flS (F h ) + (16) 
I ^FAF f |>[^a;,] + 

r 2 



1 REF 



AF h,iZRe[^ h 'a;.] + 



V 3 

v h',k 



10 where, as above, the indices h' and k' are each all indices equivalent to h due to 
space group symmetry, and the coefficients a h and K are given in Eqs. (10) and 
(13). Also, as before, ^hj and AF h ± are the differences between F h and K 

along the directions F h ° and iF h ° t respectively. All the quantities in Eq. (16) can be 
readily calculated once a likelihood function for the electron density and its 
15 derivatives are obtained (see Eq. (22) below). 

A key step in likelihood-based density modification is the decision as to the 
likelihood function for values of the electron density at a particular location in the 
map. For the present purposes, an expression for the log-likelihood of the electron 

density at a particular location x in a map is needed that depends on 

20 whether the point satisfies any of a wide variety of conditions, such as being in the 
protein or solvent region of the crystal, being at a certain location in a known 
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fragment of structure, or being at a certain distance from some other feature of the 
map. Information can be incorporated on the environment of x by writing the log- 
likelihood function as the log of the sum of conditional probabilities dependent on 
the environment of x , 



The probability that x is the protein or solvent region is estimated by a modification, 
described in Terwilliger, Acta Cryst. D55, pp. 1863-1871 (1999), of the methods 
described in Wang, Methods Enzymol. 115, pp. 90-112 (1985), and Leslie, 
Proceedings of the Study Weekend organized by CCP4, pp. 25-32 (1988), 
incorporated herein by reference. If there were more than just solvent and protein 
regions that identified the environment of each point, then Eq. (17) could be 
modified to include those as well. 

In developing Eqs. (3)-(15), the derivatives of the likelihood function for 
electron density were intended to represent how the likelihood function changed 
when small changes in one structure factor were made. Surprisingly, the likelihood 
function that is most appropriate for the present invention is not a globally correct 
one. Instead, it is a likelihood function that represents how the overall likelihood 
function varies in response to small changes in one structure factor, keeping all 
others constant. To see the difference, consider the electron density in the solvent 
region of a macromolecular crystal. In an idealized situation with all possible 
reflections included, the electron density might be exactly equal to a constant in this 
region. The goal in using Eq. (16) is to obtain the relative probabilities for each 
possible value of a particular unknown structure factor Fh . If all other structure 
factors were exact, then the globally correct likelihood function for the electron 




S 91 ,732 




13 

density (zero unless the solvent region is perfectly flat) would correctly identify the 

correct value of the unknown structure factor. 

Now suppose the phase information is imperfect. The solvent regions would 

have a significant amount of noise, and the electron density value is no longer a 
5 constant. If the globally correct likelihood function is used for the electron density, a 

zero probability would be assigned to any value of the structure factor that did not 

lead to an absolutely flat solvent region. This is clearly unreasonable, because all 

the other (incorrect) structure factors are contributing noise that exists regardless of 

the value of this structure factor. 
10 This situation is very similar to the one encountered in structure refinement of 

p macromolecular structures where there is a substantial deficiency in the model. 
f£ The errors in all the other structure factors in the discussion correspond to the 
M deficiency in the macromolecular model in the refinement case. The appropriate 
jj variance to use as a weighting factor in refinement includes the estimated model 

2; 15 error as well as the error in measurement. Similarly, the appropriate likelihood 
e function for electron density for use in the present method is one in which the 

pj overall uncertainty in the electron density due to all reflections other than the one 

fjf being considered is included in the variance. 

in 

p A likelihood function of this kind for the electron density can be developed 

u 20 using a model in which the electron density due to all reflections but one is treated 
as a random variable. See Terwilliger et al., Acta Cryst. D51 , pp. 609-618 (1996), 
incorporated herein by reference. Suppose that the true value of the electron 
density at x was known and was given by p T . Then consider that there are 
estimates of all the structure factors, but that substantial errors exist in each one. 
25 The expected value of the estimate of this electron density (Pobs ) obtained from 
current estimates of all the structure factors will be given approximately by 
< p ogs >= fi Pr , and the expected value of the variance by < (p 0BS - f3p T f >= a 2 MAP . 
The factor p represents the expectation that the calculated value of P will be 
smaller than the true value. This is true for two reasons. One is that such an 
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estimate may be calculated using figure-of-merit weighted estimates of structure 
factors, which will be smaller than the correct ones. The other is that phase error in 
the structure factors systematically leads to a bias towards a smaller component of 
the structure factor along the direction of the true structure factor. 

A probability function for the electron density at a point x that is appropriate 
for assessing the probabilities of values of the structure factor for one reflection can 
now be written as, 

{p-fiPrf < 18 > 



p(p) = exp- 



0^-2 

zcr MAP 



In a slightly more complicated case where the value of p T is not known exactly, but 
rather has an uncertainly cr Tf Eq. (18) becomes, 

, v {P-PPT? (19) 

2(P cr T + (T MAP ) 

Finally, in the case where only a probability distribution p (Pr) for Pt is known, Eq. 
(18) becomes, 

(20) 



P { P ) = \ p{pT ) e J-i^pi 



dp 7 



Using Eqs. (19) and (20), a histogram-based approach (Goldstein et al., Acta 
Cryst. D54, pp. 1230-1244 (1998)) can be used to develop likelihood functions for 
the solvent region of a map and for the macromolecule-containing region of a map. 
15 The approach is simple. The probability distribution for true electron density in the 
solvent or macromolecule regions of a crystal structure is obtained from an analysis 
of model structures and represented as a sum of gaussian functions of the form, 

(21) 



2a z k 

where the coefficients are normalized so that the integral of p{Pt) is normalized 
overall p. 



S 91 ,732 



15 



The coefficients c k9 a k9 and w k are obtained as follows. A model of a protein 
structure is used to calculate theoretical structure factors for a crystal of that protein 
structure. Exemplary structures may be obtained from the Protein Data Bank 
(H.M.Berman et al., The Protein Data Bank. Nucleic Acids Research 28, pp. 235- 
5 242, 2000), and containing space group, cell dimensions and angles, and a list of 
coordinates, atom types, occupancies, and atomic displacement parameters. The 
model may be chosen to be similar in size, resolution of the data, and overall atomic 
displacement factors to the experimental protein structure to be analyzed, but this is 
not essential to the process. The resolution of the calculated data and the average 
10 atomic displacement parameter may be adjusted to match those of the protein 
structure to be analyzed. Alternatively, a standardized resolution such as 3 
Angstrom units and unadjusted atomic displacement parameters may be used, as in 
the examples given below. The theoretical structure factors for the model are then 
used to calculate an electron density map. 



in the following way. All points in the map within a specified distance (typically 2.5 
Angstrom units) of an atom in the model are designated "protein" and all others are 



designated "solvent". The next steps are carried out separately for "protein" and 
"solvent" regions of the electron density map. A histogram of the numbers of points 



f J 20 in the protein or solvent region of the electron density map falling into each possible 
range of electron densities is calculated. The histogram is then normalized so that 
the sum of all histogram values is equal to unity. Finally, the coefficients 

c k , o\, and w k are obtained by least-squares fitting of Equation (21) to the 
normalized histograms. One set of coefficients is obtained for the "protein" region, 
25 another for the "solvent" region. 

If the values of P and & map are known for an experimental map with 
unknown errors, but identified solvent and protein regions, the probability 
distribution for electron density in each region of the map can be written 
approximately from Eq. (19) as, 




The electron density map is then divided into "protein" and "solvent" regions 
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(22) 



2(f* 2 k+ cr 2 mp ) j 

with the appropriate values of P and a map and separate values of c k , a 2 k , and w k 
for protein and solvent regions. In practice, the values of P and o- MAP are 
estimated by a least-squares fitting of the probability distributions for protein and 
solvent regions given in Eq. (22) to the ones found in the protein and solvent 
regions in the experimental map. 

This fitting is carried out by first constructing separate histograms of values of 
electron density in the protein and solvent regions defined by the methods 
described in Wang, Methods Enzymol 1 15, pp. 90-1 12 (1985) and Leslie, 
Proceedings of the Study Weekend, organized by CCP4, pp. 25-32 (1988), 
incorporated by reference. Next, the histograms are normalized so that the sum, 
over all values of electron density, of the values in each histogram is unity. In this 
way the histograms represent the probability that each value of electron density is 
observed. Then the values of P and v MA p in Eq. (22) are adjusted to minimize the 
squared difference between the values of the probabilities calculated from Eq. (22) 
and the observed values from the analysis of the histogram. This procedure has 
the advantage that the scale of the experimental map does not have to be 
accurately determined. Then Eq. (22) is used with the refined values of P and 
^map as the probability function for electron density in the corresponding region 
(solvent or macromolecule) of the map. 

The process discussed above is more particularly shown in Figures 1 and 2. 
The basic process of maximum-likelihood density modification has two parts. In the 
first part, the characteristics of model electron density map(s) are obtained (Figure 
1). These will typically be the same or similar for many different applications of the 
algorithm. In the second part (Figure 2), a particular set of structure factors has 
typically been obtained using experimental measurements on a crystal. This set of 
structure factors can be directly used to calculate an electron density map. Due to 
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uncertainties in measurement, the electron density map is imperfect. In this second 
part, a set of structure factors (phases and amplitudes) is found that is consistent 
with experimental measurements of those structure factors, and that, when used to 
calculate an electron density map, lead to an electron density that has 
5 characteristics similar to those obtained from the model electron density map(s). A 
likelihood-based approach is used to find this optimal set of structure factors. 

Figure 1 shows a process for obtaining characteristics from model electron 
density maps to use in the above equations. First, a model protein structure 
obtained by X-ray crystallography is chosen 10. The model is used to 
10 conventionally calculate an electron density map 12. The electron density map is 
segmented into "protein" and "solvent" regions 14, where the protein region 
J= contains all points within a selected proximity to an atom in the model. Histograms 
jj! of electron density are obtained 16 for "protein" and "solvent" regions. For protein 

s~~. 

fij and solvent regions, coefficients for the Gaussian function formed by Eq. (21 ) are 
J{ 1 5 found so that Eq. (21 ) is optimally fitted 18 to the histogram for that region. Eq. 
Hi (21 ), with the fitted coefficients, is output 22 as the analytical description of the 
p electron density distribution in the protein or solvent region for this model structure. 
Li Figure 2 depicts the process for finding the optimal set of structure factors for 

Ul a crystal consistent with experimental measurements and resulting in an the 
p 20 electron density map having characteristics expected from the model structure. The 
inputs are (1) the analytical descriptions of electron density distributions (Eq. 21) for 
model solvent and protein regions output from the process shown in Figure 1 ; (2) 

the fraction /solvent 0 f the crystal that is in the "solvent" region; (3) the space group 
and cell parameters of the crystal; and (4) the experimental measurements of 
25 structure factors (phases and amplitudes) and their associated uncertainties. 

The overall process steps for estimating the probability that the electron 
density at each point in the map is correct are: (1 ) obtaining probability distributions 
for electron density for the protein and solvent regions of the current electron 
density map; (2) estimating the probability that the electron density at each point in 
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the map is correct; (3) evaluating how the probabilities would change if the electron 
density at each point in the map changed; (4) using a Fourier Transform to evaluate 
how the overall likelihood of the electron density map would change if one 
crystallographic structure factor changed; (5) combining the likelihood of the map 
5 with the likelihood of having observed the experimental data, as a function of each 
crystallographic structure factor; and (6) deriving a new probability distribution for 
each crystallographic structure factor. Steps (1) through (6) are then iterated until 
no substantial further changes in structure factors are obtained. 

The process for finding structure factors that are consistent with experiments 
10 and that result in an electron density map with expected characteristics is shown in 
Figure 2. . The current best estimates of structure factors are used to calculate 32 

L..J 

*D an electron density map. If there is uncertainty in amplitude or phase, the weighted 
H mean structure factor is ordinarily used, where all possible amplitudes and phases 
FjJ are weighted by their relative probabilities. The electron density map is segmented 
p 1 5 into protein and solvent regions as described by Wang, Methods Enzymol. 1 1 5, 
pp. 90-1 12 (1985) and Leslie, Proceedings of the Study Weekend organized by 
D CCP4, p. 25-32 (1988), incorporated by reference. The analytical descriptions of 

i-jj electron density distributions for model protein and solvent regions are fitted by 

j* s ] least-squares to the observed electron density distributions in the protein and 

20 solvent regions in this electron density map using the factors P and a ma P , where 

2 

the same values of p and a ma P are used for both protein and solvent regions. 

Eq. (22), with the values of coefficients c k , o\, and w k for protein and solvent 
regions obtained from fitting Eq. (21) to the model electron density from the process 

2 

shown in Figure 1, and with the values of P and a ma P obtained above, now is an 
25 analytical description of a probability distribution for electron density in protein or 
solvent regions of the electron density map. The derivatives of Eq. (22) with respect 
to electron density (P) are obtained by standard procedures. 
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The probability of the electron density at each point in the protein or solvent 
regions of the current map is obtained 34 from Eq. (22). The logarithm of the 
overall log-likelihood of this map is calculated from the sum of the logarithms of 
these probabilities. The first and second derivatives with respect to electron density 
5 of the probability distributions for each point are calculated 36 to evaluate how the 
probability at each point would change if the electron density at each point in the 
map were changed. 

An FFT is used to calculate 38, for each structure factor, how the overall log- 
likelihood of the map would change if that structure factor were changed. Then, the 
10 log-likelihood of the map as a function of all possible values of each structure factor 
~ is estimated 42 from a Taylor's series expansion of the log-likelihood of the map. 

This provides a log-likelihood estimate of any value of each structure factor as the 
jIl sum of the log-likelihood of the resulting map with the log-likelihood of having 
*** observed the experimental data given that value. 

01 15 The new estimate 44 of the logarithm of the probability that a structure factor 



has a particular value is obtained by adding together the log-likelihood of the map 
for that value of the structure factor and the log-likelihood of observing the 
experimental value of the structure factor. The exponentiation of these values is the 
probability of each possible value of a structure factor and is used to obtain a new 



□ 20 weighted estimate of the structure factor. The new estimate of the structure factor 
is then returned to step 32 to begin a new iteration with a revised electron density 
map. 

To evaluate the utility of maximum-likelihood density modification as 
described here, the process was applied to both model and real data. The first set 
25 of tests consisted of a set of phases constructed from a model with 32%-68% of the 
volume of the unit cell taken up by protein. The cell was in space group P21212 
with cell dimensions of a = 94, b = 80, c = 43 A and one molecule in the asymmetric 
unit, and was based on 6906 model data from oo to 3.0 A calculated from 
coordinates from a dehalogenase enzyme from Rhodococcus species ATCC 55388 
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(ATCC, 1992), except that some of the atoms were not included to vary the fraction 
of solvent in the unit cell. Phases with simulated errors were generated by adding 
phase errors to yield an average value of the cosine of the phase error (i.e., the true 
figure of merit of the phasing) of, < cos(A^)) >=0.42 for acentric and 0.39 for centric 
5 reflections. 

Analyses were done using conventional real-space solvent flattening and 
reciprocal-space solvent flattening, Terwilliger, Acta Cryst. D55, pp. 1863-1871 
(1999), incorporated by reference, as well as the maximum-likelihood approach. 
Both real-space and reciprocal-space solvent flattening improved the quality of 
1 0 phasing considerably. The real space density modification included both solvent 
flattening and histogram matching to be as comparable as possible to the 
maximum-likelihood density modification according to the present invention. 

Table I shows the quality of phases obtained after each method for density 



TABLE I 



Fraction 
Protein (%) 


Starting 

< cos(A^) > 


Real Space 

< cos(A^) > 


Reciprocal 
Space 

< cos(A^) > 


Maximum 
Likelihood 

< cos(A^) > 


32 


.41 


.64 


.85 


.87 


42 


.40 


.62 


.67 


.83 


50 


.41 


.54 


.56 


.77 


68 


.42 


.48 


.41 


.53 



15 modification was applied to this model case. In all cases, maximum-likelihood 
density modification of this map resulted in phases with an effective figure of merit 
( < c os(A<£) > ) higher than any of the other methods. When the fraction of solvent in 
the model unit cell was 50%, for example, maximum-likelihood density modification 
yielded an effective figure of merit of 0.83, while real-space solvent flattening and 

20 histogram matching resulted in an effective figure of merit of 0.62 and reciprocal- 
space solvent flattening yielded 0.67. 

The utility of maximum-likelihood density modification was also compared 
with real-space density modification and with reciprocal-space solvent flattening 
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using experimental multiwavelength (MAD) data on initiation factor 5A (IF-5A). IF- 
5A crystallizes in space group 14 with cell dimensions of a = 1 1 4, = 1 1 4, c = 33 A, 
one molecule in the asymmetric unit, and a solvent content of about 60%. The 
structure was solved using MAD phasing based on three selenium atoms in the 
5 asymmetric unit at a resolution of 2.2 A. For purposes of testing density 

modification methods, only one of the three selenium sites was used in phasing 
here, resulting in a starting map with a correlation coefficient to the map calcuclated 
using the final refined structure of 0.37. 

Figures 3A-D show sections through electron density maps obtained after 
10 real-space density modification using solvent flattening and histogram matching and 
p after maximum-liklihood density modification: 

Figure 3A is an electron density map from SOLVE, calculated using only one 
M substituted selenium atom; 

Figure 3B is an electron density map determined from a model structure, 
0i 1 5 calculated from an atomic model of the protein; 

'*" Figure 3C is an electron density map determined using the process of the 

present invention (RESOLVE); 

Figure 3D is an electron density map calculated using a software program 
"dm," K. Cowtan, "dm: An automated procedure for phase improvement by density 
H 20 modification," Joint CCP4 and ESF-EACBM Newsletter on Protein Crystallography 
31, pp. 34-38 (1994). 

As anticipated, the "dm"-modified map is improved over the starting map and 
has a correlation coefficient of 0.65. The maximum-likelihood modified map is even 
more substantially improved with a correlation coefficient to the map based on a 
25 refined model of 0.79. 

While the above demonstration considered only two sources of expected 
electron density distributions (probability distributions for solvent regions and for 
protein-containing regions), the methods can be applied directly to a wide variety of 
sources of information. For example, any source of information about the expected 
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electron density at a particular point in the unit cell that can be written in a form such 
as the one in Eq. (22) can be used in the procedure to describe the likelihood that a 
particular value of electron density is consistent with expectation. 

Sources of expected electron density information that are especially suitable 
for application to the present method include non-crystallographic symmetry and the 
knowledge of the location of fragments of structure in the unit cell. In the case of 
non-crystallographic symmetry, the probability distribution for electron density at 
one point in the unit cell can be written using Eq. (22) with a value of P T equal to 
the weighted mean at all non-crystallographically equivalent points in the cell. The 
value of o> can be calculated based on their variances and the value of or map . In 
the case of knowledge of locations of fragments in the unit cell, this knowledge can 
be used to calculate estimates of the electron density distribution for each point in 
the neighborhood of the fragment. These electron density distributions can then, in 
turn, be used as described above to estimate P T and o> in this region. 

An iterative process could be developed in which fragment locations are 
identified by cross-correlation or related searches, density modification is applied, 
and additional searches are carried out to further generate a model for the electron 
density. Such a process could potentially even be used to construct a complete 
probablistic model of a macromolecular structure using structure factor estimates 
obtained from molecular replacement with fragments of macromolecular structures 
as a starting point. 

In all these cases, the electron density information could be included in much 
the same way as the probability distributions that are used herein for the solvent 
and protein regions of maps. In each case, the key is an estimate of the probability 
distribution for electron density at a point in the map that contains some information 
that restricts the likely values of electron density at that point. The procedure could 
be further extended by having probability distributions describing the likelihood that 
a particular point in the unit cell is within a protein region, within a solvent region, 
within a particular location in a fragment of protein structure, within a non- 
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crystallographically related region, and so on. These probability distributions could 
be overlapping or non-overlapping. Then, for each category of points, the 
probability distribution for electron density within that category could be formulated 
as in Eq. (22) and the method of the present invention applied. 
5 This process extends reciprocal-space solvent flattening in two important 

ways. One is that the expected electron density distribution in the non-solvent 
region is included in the calculations, and a formalism for incorporating information 
about the electron density map from a wide variety of sources is developed. The 
second is that the probability distribution for the electron density is calculated using 
1 0 Eq. (22) for both solvent and non-solvent regions and values of the scaling 

parameter fi and the map uncertainty & MAP are estimated by a fitting model and 
y3 observed electron density distributions. This fitting process makes the whole 

jJJ procedure very robust with respect to scaling of the experimental data, which 

fU otherwise would have to be very accurate in order that the model electron density 

pi 1 5 distributions be applicable. 

l * J The foregoing description of the invention has been presented for purposes 

P of illustration and description and is not intended to be exhaustive or to limit the 

Lj] invention to the precise form disclosed, and obviously many modifications and 

U] variations are possible in light of the above teaching. The embodiments were 

E "5 

O 20 chosen and described in order to best explain the principles of the invention and its 
practical application to thereby enable others skilled in the art to best utilize the 
invention in various embodiments and with various modifications as are suited to 
the particular use contemplated. It is intended that the scope of the invention be 
defined by the claims appended hereto. 



